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Abstract 

Given a multivariate data set, sparse principal component analysis (SPCA) aims to extract 
several linear combinations of the variables that together explain the variance in the data as much 
as possible, while controlling the number of nonzero loadings in these combinations. In this paper 
we consider 8 different optimization formulations for computing a single sparse loading vector; 
these are obtained by combining the following factors: we employ two norms for measuring 
variance (L2, LI) and two sparsity-inducing norms (LO, LI), which arc used in two different 
ways (constraint, penalty). Three of our formulations, notably the one with LO constraint and 
LI variance, have not been considered in the literature. We give a unifying reformulation which 
wc propose to solve via a natural alternating maximization (AM) method. We show the the AM 
method is nontrivially equivalent to GPower (Journcc ct al; JMLR 11:517-553, 2010) for all 
our formulations. Besides this, we provide 24 efficient parallel SPCA implementations: 3 codes 
(multi-core, GPU and cluster) for each of the 8 problems. Parallelism in the methods is aimed 
at i) speeding up computations (our GPU code can be 100 times faster than an efficient serial 
code written in C++), ii) obtaining solutions explaining more variance and iii) dealing with big 
data problems (our cluster code is able to solve a 357 GB problem in about a minute). 

Keywords: sparse principal component analysis, alternating maximization, GPower, high 
performance computing, big data analytics, unsupervised learning. 

1 Introduction 

Principal component analysis (PCA) is an indispensable tool used for dimension reduction in vir- 
tually all areas of science and engineering, from machine learning, statistics, genetics and finance 
to computer networks [l]. Let A G j^"xp denote a data matrix encoding n samples (observations) 
of p variables (features). PCA aims to extract a few linear combinations of the columns of A, called 
principal components (PCs), pointing in mutually orthogonal directions, together explaining as 
much variance in the data as possible. If the columns of A are centered, the problem of extracting 
the first PC can be written as 

max{||Arc|| : ||2:||2 < 1}, (1.1) 
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where || • || is a suitable norm for measuring variance. The solution x of this optimization problem 
is called the loading vector, Ax (normalized) is the first PC. Further PCs can be obtained in the 
same way with A replaced by a new matrix in a process called deflation [2]. Classical PC A employs 
the L2 norm in the objective; using the Li norm instead may alleviate problems caused by outliers 
in the data and hence leads to a robust PCA model [3]. 

As normally there is no reason for the optimal loading vectors defining the PCs to be sparse, 
they are usually combinations of all of the variables. In some applications, however, sparse loading 
vectors enhance the interpretability of the components and are easier to store, which leads to the 
idea to induce sparsity in the loading vectors. This problem and approaches to it are known collec- 
tively as sparse PCA (SPCA); for some recent work, see [4]-|llj. A popular way of incorporating 
a sparsity-inducing mechanism into the above optimization formulation is via either a sparsity- 
inducing constraint or penalty. Two of the most popular functions for this are the Lq and Li norm 
of the loading vector x (the Lq "norm" of x, denoted by ||x||o, is the number of nonzeros in x). 

1.1 Eight optimization formulations 

In this paper we consider 8 optimization formulations for extracting a single sparse loading vector 
(i.e., for computing the first PC) arising as combinations of the following three modeling factors: 
we use two norms for measuring variance (classical L2 and robust Li) and two sparsity-inducing 
(SI) norms (cardinality Lq and Li), which are used in two different ways (as a constraint or a 
penalty). All have the form 

OPT = max /(x), (1.2) 

with X C and / detailed in Table 1. Note that if we set s = p in the constrained or 7 = in 
the penalized versions, the sparsity-inducing functions stop having any effeclll] and we recover the 
classical and robust PCA (jl.ip . Choosing l<s<p, 7>0 will have the effect of directly enforcing 
or indirectly encouraging sparsity in the solution x. 



# 


Variance 


SI norm 


SI norm usage 


X 




1 


L2 


Lo 


constraint 


{xeW : \\x\\2 < 1, llsllo < s} 


\\Ax\\2 


2 


Li 


Lq 


constraint 


{x e W : \\x\\2 < 1, llsllo < s} 


\\Ax\U 


3 


L2 


Li 


constraint 


{xelV : \\x\\2 < 1, ||s||i < ^} 


\\Ax\\2 


4 


Li 


Li 


constraint 


{xelV : \\x\\2 < 1, ll^lli < 


\\Axh 


5 


L2 


Lo 


penalty 


{xelV : \\x\\2 < 1} 


||Aa::||i -7||a;||o 


6 


Li 


Lo 


penalty 


{xeW : \\x\\2 < 1} 


\\Ax\\l - -f\\x\\o 


7 


L2 


Li 


penalty 


{x&W ■ \\x\\2 < 1} 


\\Ax\\2 - 'y\\x\\i 


8 


Li 


Li 


penalty 


{xeR." : \\x\\2 < 1} 


P^lli -7||x-||i 



Table 1: Eight sparse PCA optimization formulations; see 11.21 

All 4 SPCA formulations of Table 1 involving L2 variance were previously studied in the litera- 
ture and are very popular. For instance, [6] solve a series of convex relaxations of the Lq constrained 

""^In the Li penalized formulations this can be seen from the inequality ||a::||i < \/||s||o ||2;||2- 
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L2 variance problem, [7] considered the Lq penalized and constrained formulations, [9] studied the 
Lq and Li penalized versions, [10] looked at all four. The Li constrained Li variance formulation 
was first proposed only recently, in [11]. To the best of our knowledge, the remaining three Li vari- 
ance formulations were not considered in the literature before. In particular, the Lq constrained 
Li variance formulation is new — and is perhaps preferable as it directly constraints the cardinality 
of the loading vector x without using any proxies. 

1.2 Reformulation and alternating maximization (AM) method 

In all 8 formulations we introduce an additional (dummy) variable y, which allows us to propose 
a generic alternating maximization method for solving them: i) for a fixed loading vector, find the 
best dummy variable (one maximizing the objective), then ii) fix the dummy variable and find the 
best loading vector; repeat steps i) and ii). This and the resulting algorithms are described in 
detail in Section [2j The generic AM method is not limited to our choice of SPCA formulations. 
Indeed, it is applicable, for instance, if instead of measuring the variance using either the Li or 
the L2 norm, we use any other norm. One critical feature shared by the formulations in Table 1 is 
that steps i) and ii) of the AM method can be performed efficiently, in closed form, with the main 
computational burden in each step being a matrix-vector multiplication (Ax in step i) and A^y 
in step ii)). Our method produces a sequence of loading vectors x^''\ k > 0, with monotonically 
increasing values f{x^^^). 

Our approach of introducing a dummy variable and using AM is similar to that of [9], where it 
is done implicitly, but mainly to |12j . where it is fully explicit, albeit used for different purposes. 

Besides providing a conceptual unification for solving all 8 formulations using a single algorithm 
(AM), the main theoretical result of this paper is establishing that, surprisingly, in all 8 cases, the 
AM method is equivalent to the GPower method [9] applied to a certain derived objective function, 
with iterates being either the loading vectors or the dummy variables, depending on the formulation. 
This result is stated and proved in Section [3l 

1.3 Parallelism 

Besides giving a new unifying framework and a generic algorithm for solving a number of SPCA 
formulations, 5 of which were previously proposed in the literature and 3 not, our further contribu- 
tion is in providing efficient strategies for parallelizing AM at two different levels: i) running AM 
in parallel from multiple starting points in order to obtain a solution explaining more variance and 
ii) speeding up the linear algebra involved. This is described in detail in Section HI 

Moreover, we provide parallel open-source codes implementing these parallelization strategies, 
for each of our 8 formulations, on 3 computing architectures: i) multi-core machine, ii) GPU-enabled 
computer, and iii) computer cluster. We also provide a serial code; however, as nearly all modern 
computers are multi-core, the serial implementation only serves the purpose of a benchmark against 
which once can measure parallelization speedup. Hence, we provide a total of 8 x 3 = 24 parallel 
sparse PCA codes based on AM. Numerical experiments with our multi-core, GPU and cluster 
codes are performed in Section [5j 

Parallelism in our codes serves several purposes: 

1. Speeding up computations. As described above, the AM method computes a matrix- vector 
multiplication at every iteration; this can be parallelized. We find that our GPU implementa- 
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tions are faster than our multi-core implementations, which are, in turn, considerably faster 
than the benchmark single-core codes. 

2. Obtaining solutions explaining more variance. In some applications, such as in the computa- 
tion of RIP constants for compressed sensing [13] , it is critical that a PC is computed with as 
high explained variance as possible. The output of our 8 subroutines depends on the starting 
point used; it only finds local solutions. Running them repeatedly from different starting 
points and keeping the solution with the largest objective value results in a PC explaining 
more variance. There are several ways in which this can be done, we implement 4 (NAI = 
"naive", SFA = "start-from-all" , BAT = "batches" and OTF = "on-the-fly" ) ; details are 
given in Section [H A naive (NAI) approach is to do this sequentially; a different possibility 
is to run the method from several or all starting points in parallel (BAT, SFA), possibly 
asynchronously (OTF). This way at each iteration we need to perform a matrix-matrix mul- 
tiplication which, when computed in parallel, is performed significantly faster compared to 
doing the corresponding number of parallel matrix-vector multiplications, one after another. 

3. Dealing with big data problems. If speed matters, for problems of small enough size we 
recommend using a GPU, if available. Since CPUs have stricter memory limitations than 
multi-core workstations (a typical GPU has 6GB RAM, a multi-core machine could have 
20GB RAM), one may need to use a high- memory multi-core workstation if the problem size 
exceeds the GPU limit. However, for large enough (=big data) problems, one will need to 
use a cluster. Our cluster codes partition A, store parts of it on different nodes, and do the 
computations in a distributed way. 

Notation. By x and y we denote column vectors in and R", respectively. The coordinates 
of a vector are denoted by subscripts (eg., xi,X2, ■ ■ ■) while iterates are denoted by superscripts in 
brackets (eg., x^'^\ x^^\ . . .). We reserve the letter k for the iteration counter. By ||a;||o we refer 
to the cardinality (number of nonzero loadings) of vector x. The Li,L2 and L^o norms are defined 
by ||z||i = X^jl-^il, ||2;||2 = i^i^i)^^'^ ™d Halloo = niaxi|zi|, respectively. For a scalar t, we let 

= max{0,t} and by sgn(t) we denote the sign of t. 



2 Alternating Maximization (AM) Method 

As outlined in the previous section, we will solve ()1.2p by introducing a dummy variable y into each 
of the 8 formulations and apply an AM method to the reformulation. First, notice that for any 
pair of conjugate norms || • || and || • ||*, we have, by definition, 

||z|| = max y^z. (2.3) 

lls/ll*<i 

In particular, || • = || • II2 and || • \\l = \\ • ||oo- 

Now, let Y := {y £ R" : \\y\\2 < 1} for the L2 variance formulations and Y := {y € R" : 
II2/II00 < 1} for the Li variance formulations. Further, let F{x,y) be the function obtained from 
f{x) after replacing ||^x|| with y^ Ax (resp. with [y'^ Ax)"^). Then, in view of the above, 

(jl.2p takes on the equivalent form 

OPT = max max F(x, J/). (2-4) 
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That is, the 8 problems from Table [J can be reformulated into the form (j2.4p : the details can be 
found in Table [2j 



# 


X 


Y 


F{x,y) 


1 


{x eW : \\x\\2 < 1, llxllo < s} 


{yGR" : bll2< 1} 


y'^Ax 


2 


{x eW : \\x\\2 < 1, llxllo < s} 


{yGR" : b||oc<l} 


y'^Ax 


3 


{x G Rf : llslla < 1, ||2;||i < ^} 


{yGR" : ||y||2<l} 


y'^Ax 


4 


{x G Rf : llslla < 1, < ^s} 


{yen- : ||y||oo<l} 


y'^Ax 


5 


{x G R'' : ||2:||2 < 1} 


{j/GR" : bl|2<l} 


(y^Ax)2-7||x||o 


6 


{x G R'' : ||x||2 < 1} 


{yGR" : llylk<i} 


(y'^Ax)^ - 7x0 


7 


{x G R'' : ||x||2 < 1} 


{yGR" : l|y||2<i} 


i/^Ax-7||x||i 


8 


{x G R'' : ||x||2 < 1} 


{yGR" : l|y||oo<i} 


y^Ax — 7||x||i 



Table 2: Reformulations of the problems from Tabled) 
We propose to solve ()2.4p via Algorithm [TJ 

Algorithm 1 Alternating Maximization (AM) Method. 

Select initial point x^^^ G TV; k ^0 
Repeat 

yik) ^ y(^x^^^) := argmaxygy y) 
^(fc+i) ^ x{y^^^) := argmaxa;gx -^(3;, y^'^^) 
Until a stopping criterion is satisfied 



2.1 Solving the subproblems 

All 8 problems of Table [2] enjoy the property that both of the steps (subproblems) of Algorithm [1] 
can be computed in closed form. In particular, each of these 8x2 subproblems is of one of the 6 
forms listed in Table [3l 

The proofs of these elementary results, many of which are of folklore nature, can be found, for 
instance, in [10] (and partially in [9]). The columns of Table [3l from left to right, correspond to the 
objective function, feasible region, maximizer (optimal solution) and maximum (optimal objective 
value). The first result will be used both with z = x and z = y, the second result with z = y and 
the remaining four results with z = x. 

Table [3] is brief at the cost of referring to a number of operators (T^ ,U^, Vy : R"^ 1— )• R™ and 
Xg ■ R"' I— )• R), which we will now define. For a given vector a € R*" and integer s G {0, 1, . . . , m}, 
by Ts{a) G R™ we denote the vector obtained from a by retaining only the s largest components of 
a in absolute value, with the remaining ones replaced by zero. For instance, for a = (1, —4, 2, 5, 3)"^ 
and s = 2 we have Ts{a) = (0, —4, 0, 5, 0)"^. For 7 > 0, we define operators and element-wise 
for i = 1, . . . ,m as follows: 

(C/^(a))i := ai[sgn{af - 7)] + , (2.5) 
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Subproblem # 




Z 


2* 


0(2*) 


SI 


a z or (a z) 


PI|2<1 




ll«l|2 


S2 


T 

a z 


\\A\oo < 1 


sgn(a) 


Iklll 


S3 


T 

a z 


PII2 < 1, ||z||o < s 


l|T.(a)||2 


r.wii2 


S4 


T 

a z 


Wzh < 1, Pill < \/5 


II^A,(a)(<')ll2 


As(a)Vs + ||Vx^(a)(a)||2 


S5 


{a^zf - ^\\z\\o 


Pl|2< 1 


\\U-,{a)\\2 


||t/^(a)||i-7||t/^(a)||o 


S6 


z - 7 2: 1 


l|2||2< 1 


V~,{a) 
\\V-,(a)\\2 


r-y(«)ll2 



Table 3: Closed-form solutions of AM subproblems; z* := arg max^jg^ ^(z). 



{Vy{a))i := sgn(ai)(|ai| - 7)+. (2.6) 

Furthermore, we let 

Xs{a) := argminA^s + ||T^A(a)||2, 
which is the solution of the one-dimensional dual of the optimization problem in line 4 of Table [3l 

2.2 The AM method for all 8 SPCA formulations 

Combining Algorithm [1] with the subproblem solutions given in Table [3l the AM method for all 
our 8 SPCA formulations can be written down concisely; see Algorithm [2j 

Algorithm 2 AM method for solving the 8 SPCA formulations of Table [2j 

Select initial point x^^^ G R^; k ^0 
Repeat 
u = Ax'^''^ 

If Li variance then y^^^ ^ sgn(n) 

If L2 variance then y^^) ^ u/\\u\\2 

If Lo penalty then ^ U^{v) /\\U^{v)\\2 

If Li penalty then ^ (^')/ II ^7(^^)11 2 

If Lo constraint then ^ Ts{v) /\\Ts{v)\\2 

If Li constraint then x('=+^) ^ Vx^(^u){v) /\\Vx^{^,){v)\\2 
k^k + l 
Until a stopping criterion is satisfied 



Note that in the methods described in Algorithm [2] it is not necessary to normalize the vector 
U-y{v) (resp. V'y{v), Ts{v), and Vx^(^a)iv)) when computing x^'''^^^ since clearly the iterate y^^~^^\ 
which depends on x^''~^^\ is invariant under positive scalings of x^'^^-^^ We have to remember, 
however, to normalize the output. The method is terminated when a maximum number of iterations 
maxit is reached or when 
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whichever happens sooner. 



3 Equivalence of AM and GPower 

GPower (generahzed power method) [9] is a simple algorithm for maximizing a convex function ^ 
on a compact set $7, which works via a "linearize and maximize" strategy. If by ^'{z^^^) we denote 
an arbitrary subgradient of ^' at z^^^ , then GPower performs the following iteration: 

^(^=+1) = argmax{^(z('=)) + - z^'^))} = argmax(^'(zW), z). (3.7) 

The following theorem, our main result, gives a nontrivial insight into the relationship of AM 
and GPower, when the former is applied to solving any of the 8 SPCA formulations considered, 
and GPower is applied to a derived problem, as described by the theorem. 

Theorem 1 (AM = GPower). The AM and GPower methods are equivalent in the following sense: 

1. For the 4 constrained sparse PC A formulations of Tahle\^ the x iterates of the AM method ap- 
plied to the corresponding reformulation of Table are identical to the iterates of the GPower 
method as applied to the problem of maximizing the convex function 

Fy(x) =^ msxF(x,y) 

on X, started from x^^\ 

2. For the 4 penalized sparse PGA formulations of TableUl the y iterates of the AM method ap- 
plied to the corresponding reformulation of Table [H are identical to the iterates of the GPower 
method as applied to the problem of maximizing the convex function 

Fx{y) =^ max F{x,y) 

x&X 

on Y , started from y^^^ . 
Proof. Recall that we wish to solve the problem 

OPT = max f(x) = m&xm&xF(x,v) = maxmaxF(x,v) . 

x&X''^ ' x&X y&Y ^ ^' y&Y xf^X ^ 
Fy(x) Fx{y) 

We will now prove the equivalence for all 8 choices of (/, X, Y, F) given in Tables [1] and [2j In the 
proofs we will also refer to the closed form solutions of the subproblem (S1)-(S6), as detailed in 
Table El 

Consider first the constrained formulations: 1,2,3 and 4. By induction assume that the k-ih. 
x-iterate {x^^^) of AM is identical to the k-ih. iterate of GPower (for fc = this is enforced by the 
assumption that GPower is started from x^*^^). By considering all 4 formulations individually, we 
will show that x^^^^'^ produced by AM and GPower are also identical. 
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Formulation 1 : Here we have 

f{x) = \\Ax\\2, F{x,y) = y^Ax, 
X = {x G R^' : ||x||2 < 1, ||x||o < s}, y = {y G R" : \\y\\2 < 
First, note that 

(SI) 

Fy(x) = max F(x,y) = \\Ax\\2, 
the gradient of which is given by 

„i , , A. .Ax 
Given x^'^^ in the AM method we have 

(k) r./ (k) ^ (-51) 

y^ ' = arg max r [x^ ' , y) = 



One iteration of GPower started from x^^'^ will thus produce the iterate 

ffc+n ,w/ ffci\ \ EU A^Ax^^^ 

x^ ' = argmax(i^v(a^ i^x) = arg max ( — — ,,,,, . 

(13. 9t , .rp /U\ 

= argmax(A y^ ',x) 
(S3) T,(^^yW) 



X 



\\Ts{ATyi'^))\\2 

Observe that this is precisely how x^^^^^ is computed in the AM method. 
Formulation 2: Here we have 

f{x) = \\Ax\\i, F{x,y) = y'^Ax, 
X = {xeKP : \\x\\2 < 1, ||x||o <s}, y = {?/ G R" : \\y\\oo < 
First, note that 

Fy(rE) =maxF(x,y) ^ = ^ 

y£Y 

the gradient of which is given by 

Fy(x) = A^ sgn{Ax). 
Given x^''\ in the AM method we have 

yC") = argmaxFfx^'^^y) ^ = ^ sgn(Ax^''h. 

y€Y 



One iteration of GPower started from x^^^ will thus produce the iterate 

''^ argmax(Fy(x('')),2;) arg max (yl^ sgn(Ax^''^), x 

= arg max( A ' ,x) 



\\Ts{ATyik))y 

Observe that this is precisely how computed in the AM method. 

Formulation 3: Here we have 

/(x) = \\Ax\\2, F{x,y) = y^Ax, 
X = {x G : ||x||2 < 1, ||x||i < Vs}, Y = {yeTC : \\y\\2 < 1}. 
First, note that 

(SI) 

Fy(x) = maxF(x,y) = ||^x||2, 
the gradient of which is given by 

''^•<^> = ■ <^-'^> 

Given x^'^), in the AM method we have 

= argmaxF(xW,y) (3-13) 

One iteration of GPower started from x^*^^ will thus produce the iterate 

^ ^ = argmax(Fy(xU),x) = argmax ^p-^, x^ 

13.151 1 , .T^ (u\ , 

= arg max(^ y^ ' , x) 

(54) ^A,(A^y(fe))(-4^j/^^^^) 

II^A.(AT,W)W=))ll2' 

Observe that this is precisely how x^'^"'"^) is computed in the AM method. 
Formulation 4 ■ Here we have 

/(x) = ||Ax||i, F{x,y) = y^Ax, 
X = {x G RP : ||x||2 < 1, ||x||i < Vs}, y = {y G R" : ||y||oo < !}■ 
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First, note that 



(SI) 

Fy{x) = max F{x,y) = \\Ax\\i, 

y&Y 



the gradient of which is given by 

Given x^^\ in the AM method we have 

= argmaxF(xW,y) ^ = ^ „ f ■ (3.15) 

One iteration of GPower started from x^^'^ wih thus produce the iterate 

^ ^ = argmax(F^(.U),x) = argmax (^p-^, 

l|3.15| l , .rp IU\ . 

= argmax(A ',x) 

(aV^) 

II^A.(A-.W)(^V))I|2' 

Observe that this is precisely how a;^'^"'"^) is computed in the AM method. 

Consider now the penalized formulations: 5, 6, 7 and 8. By induction assume that the k-th 
y-iterate {y^^^) of AM is identical to the k-th. iterate of GPower (for A: = this is enforced by the 
assumption that GPower is started from y^^^). By considering ah 4 formulations individually, we 
win show that produced by AM and GPower are also identical. Let A = [ai, . . . ,ap], i.e., 

the i-th column of A is Oj. 

Formulation 5: Here we have 

fix) = WAxWl - iMo, F{x,y) = {y'^Axf - 7||x||o, 

X = {xeBP : \\x\\2 < 1}, y = {y G R" : Wyh < 1}. 

First, note that 

p 



Fx{y) = maxF(x,y) ^ = ^ \\U,{A'y)\\l - ^\\U,{A'y)\\o = Y.[{ajyf - 7]- 



i=l 

the subgradient of which is given by 

p 



F'xiv) = 2 5^[sgn((afy) -7)]+(afy)a, ^ 2AU,{A^y). (3.16) 
=1 
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Given y^^\ in the AM method we have 



One iteration of GPower started from y^*^) wih thus produce the iterate 

^ argmax(Fj.(y(*^)),y) ^ arg max (2 AU J A^y),y) 
yeY l|j/||cx,<i 

I I3.17I I , . , 

= arg max [Ax^ ,y) 
lly||2<i 

(51) Ax^^~^^^ 



\\Ax(''+^)\\2' 

Observe that this is precisely how computed in the AM method. 

Formulation 6: Here we have 

f{x) = \\Ax\\\ - 7||x||o, F{x,y) = {y'^Axf - 7||x||o, 
X = {xeW : ||x||2 < 1}, y = {y G R" : ||y||oo < 1}- 

First, note that 



Fx[y) = maxF(rE, y) \\U^{A^y)\\l - ^\\U^{A^y)h = Y^-'^aJyf - ^ 



1=1 

the subgradient of which is given by 

p 



F'xiv) = 2^[sgn((afy) -7)]+(afy)a, ^ 2AU,{A^y). 

i=l 



Given y^^\ in the AM method we have 

^(fc+i) _ argmaxFfrr v^^^ Uii^^V^'^) 

-argmaxi^(rE,y j - ||^^(^Ty(fc))||2 " 

One iteration of GPower started from y^*^) wiU thus produce the iterate 

y('=+^) ^ argmax(Fj.(y(^)),y) W arg max (2 AC/^M^y),y) 
y&y llj/lloo<i 

= arg max \Ax^ ,y) 

Il2/I|oo<l 

sgn{Ax^'+'^). 

Observe that this is precisely how 

y{k + l) ig 

computed in the AM method. 
11 



Formulation 7: Here we have 



f{x) = \\Ax\\2 - j\\x\\i, F{x,y) = y^Ax - 7||x||i, 

X = {xeRP : ||x||2 < 1}, y = {yGR" : ||y||2 < !}• 

Note that the functions y i— t- F{x, y) are hnear and that, by definition, Fx{y) = maxa;gx F{x^ u)- 
Moreover, note that the gradient of y i— )• F{x, y) at y is equal to Ax. Hence, if x is any vector 
that maximizes F(x,y^^^) over X, then Ax is a subgradient of Fx at y^^\ Note that this is 
precisely how x^'^^^^ is defined in the AM method: x^'^'^^^ = argmajCx^x F{x,y^''^). Hence, 
^jjC^+i) is a subgradient of Fx at y^^^ and one iteration of GPower started from y^'^^ will 
produce the iterate 

y{k+i) m argmax(Fjf(y('=)),y) = arg max {Ax^''+^\y) ^ = ^ „ f , ..^„ ■ 
yeY \\y\\2<i \\Ax(''+'-)\\2 

Observe that this is precisely how 

y{k + l) ig 

computed in the AM method. 

Formulation 8: Here we have 

f{x) = \\Ax\\i - 7||a;||i, F{x,y) = y'^Ax - 7||x||i, 

X = {x G : ||x||2 < 1}, y = {y G R" : ||y||oo < !}• 

Note that the functions y i— )■ F{x, y) are linear and that, by definition, Fx{y) = maxxi=x F{x, y). 
Moreover, note that the gradient of y i— )• F{x, y) at y is equal to Ax. Hence, if x is any vector 
that maximizes F{x,y^^^) over X, then Ax is a subgradient of Fx at y^^\ Note that this is 
precisely how x^^^^'' is defined in the AM method: x^^'^'^'^ = aiguiayix^x F{x,y^^^)- Hence, 
is a subgradient of Fx at y'^^^ and one iteration of GPower started from y^^^ will 
produce the iterate 

^(fc+i) m argmax(Fjf(y('=)),y) = arg max {Ax^^+^\y) ^ = ^ sgn{Ax^''+^^). 

y& \\y\\oo<i 

Observe that this is precisely how is computed in the AM method. 

□ 

Having established equivalence between AM and GPower, local convergence of the AM method 
for all 8 SPCA formulations follows from the theory developed in |S] and |1U| . 



4 Embedding AM within a Parallel Scheme 

In this section we describe several approaches for embedding Algorithm [2] (AM) within a par- 
allel scheme for solving / identical SPCA problems, started from a number of starting points, 
x^^'^\ . . . , x'-''''^ This is done in order to obtain a loading vector explaining more variance and will 
be discussed in more detail in Section 14.11 

As we will see, it may not necessarily be most efficient to solve all I problems simultaneously. 
Instead, we consider a class of parallelization schemes where we divide the / problems into "batches" 
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of r problems each, and solve each batch of r problems simultaneously. In this setting at each it- 
eration we need to perform identical operations in parallel, notably matrix-vector multiplications 
^x^'^'^), . . . , and A^yC^'i), . . . , A^y^'^^^'l It is useful to view the sequence of matrix-vector 

products as a single matrix-matrix product, e.g., A the first case, and use opti- 

mized libraries for parallelization. This simple trick leads to considerable speedups when compared 
to other approaches. We use similar ideas for the parallel evaluation of the operators. Note that 
even in the I = 1 case, i.e, if we wish to run SPCA from a single starting point only, there is scope 
for parallelization of the matrix-vector products and function evaluations. Hence, parallelization 
in our method serves two purposes: 

1. to obtain solutions explaining more variance by solving the problem from several starting 
points (we choose / > 1), 

2. to speed up computations by parallelizing the linear algebra involved (this applies to both 
/ = 1 and / > 1 cases). 



naive 
(NAI) 


start from all 
(SFA) 


batches 
(BAT) 


on the fly 
(OTF) 




6^ 




°0'' 


5 




a 

1 


2 

5 
1 


■ 


1 


3 







Figure 1: Four ways of embedding Algorithm 2 (AM) in a parallel scheme. In this example we run 
AM on the same problem I = 6 times, using different starting points. 

In particular, in this section we describe 4 parallelization approaches: 

• NAI = "naive" (r = 1), 

• SFA = "start-from-all" ( r = /), 

• BAT = "batches" {I < r < I) 

• OTF = "on-the-fly" (BAT improved by a dynamic replacement strategy to reduce idle time). 

The working of these 4 approaches is illustrated in Figure [1] in a situation with I = Q. In what 
follows we describe the methods informally, in a narrative style, with a suitable choice of numerical 
experiments illustrating the differences between the ideas. 

4.1 The hunt for more explained variance 

As shown in |9], |10] for GPower, and due to our equivalence theorem (Theorem 1), we know that 
Algorithm 2 (AM) is only able to converge to a local solution. Moreover, quality of the solution will 
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depend on the starting point (SP) x^^^ used. When the algorithm is run just once, the quahty of 
the obtained solution, in terms of the objective value (or explained variance), can be poor. Hence, 
if the amount of explained variance is important, it will be useful to run the method repeatedly 
from a number of different SPs. In this and all subsequent experiments we generated A S 
with independent and uniformly distributed entries from [—1, 1]. Here chose n = 512 and p = 2, 048 
(and renormalized the columns so that their norms are uniformly distributed on [0, 1]) and solved 
the corresponding Lq constrained L2 variance SPCA problem with s = 1,2,4,..., 2048. For each s 
we run AM from 1 = 1, 000 randomly generated SPs with maxit = 200 and tol = 10~^. The results 
are given in Figure [H where the vertical axis corresponds to the amount of explained variance of a 
particular solution compared to the best solution found. 



+ - ^ 



2 4 8 16 64 256 
Target sparsity level s 



1024 



Figure 2: It may be easy to converge to a poor local solution. 



Clearly, for small s it is easy to obtain a bad solution if we run the method only a few times; 
this effect is milder for large s but may be substantial nevertheless in real life problems. Hence, 
especially when s is small, it is necessary to employ a globalization strategy such as rerunning AM 
from a number of different starting points. This experiment illustrates that the simple strategy 
of running the method from a number of randomly generated starting points can be effective in 
finding solutions with more explained variance. A "naive" (NAI) approach would be to do this 
sequentially: solve the problem with one starting point first before solving it for another starting 
point. 



4.2 Economies of scale 

Running AM in parallel, started from a number of SPs, increases the utilization of computer 
resources, especially on parallel architectures. In order to demonstrate this, we generated 6 data 
matrices with p = 1000, 2000, . . . , 32000 and run the AM method for the Lq penalized L2 variance 
SPCA formulation with / = 256 SPs (and maxIt = 10). By BATr we denote the approach with 
batches of size r. Hence, SFA = BAT256 and NAI = BATl. Besides these two basic choices, we 
look at BAT4, BAT16 and BAT64 as well. The results can be found in Figure [3l 

Different problem sizes p appear on the horizontal axis; on the vertical axis we plot the speedup 
obtained by applying a particular batching strategy compared to NAI. Note that even on a single- 
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core computer (LEFT plot) we benefit from running the methods in parallel ("economies of scale") 
rather than running them one after another. Indeed, we can obtain a 2 — 3x speedup with BAT16 
across the whole range of problem sizes, and 4x speedup with SFA for large enough p. With 12 
cores (RIGHT plot) the effect is much more dramatic: the speedup for BAT16 is consistently in 
the 10 — 20 X range, and can even reach 50 x for SFA. 

4.3 Dynamic replacement 

It often happens, especially when batch size is large, that some problems within a batch converge 
sooner than others. The vanilla BAT approach described above does nothing about it, and continues 
through matrix-matrix multiplies, updating the already converged iterates, until the last problem 
in the batch converges. A minor but not negligible speedup is possible by employing an "on-the-fly" 
(OTF) dynamic replacement technique, where whenever a certain problem converges, it is replaced 
by a new one. Hence, no predefined batches exist — OTF can be viewed as a greedy list scheduling 
heuristic. We used I = 1024 starting points and compare SFA1024 with BAT64 and OTF64-the 
dynamic replacement variant of BAT64. 

Looking at the LEFT plot in Figure HI we see that the average number of iterations per starting 
point is much smaller for OTF. This results in speedup of more than 2x when compared with SFA 
(RIGHT plot). Notably, SFA is slower than both BAT64 and OTF64, which shows that it may 
not be optimal to choose r = I. 

5 Multi-core Processors, GPUs and Clusters 

Accompanying this paper is the open source software package "24am"[l implementing paralleliza- 
tion strategies described in Section HI all with Algorithm 2 (AM) used as the underlying solution 
method, with the option of using any of the 8 optimization formulations of SPCA described in 
Tabled) The name 24am comes from the fact that we implement the solver for 3 different parallel 

• ^https : //code . google ■ com /p/ 24am / 1 
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Figure 4: Dynamic Replacement: "On-the-fly" (OTF) is better than "Batches" (BAT), which is 
better than "start-from-ah" (SFA). 



architectures: multi-core processors, GPUs and computer clusters, leading to 24 = 8 x 3 methods 
based on AM. 

In this the rest of this section we first perform several numerical experiments illustrating the 
speedups obtained by parallelization on these three computing architectures. We then conclude 
with a real-life numerical example (large text corpora) and a few implementation remarks. 



5.1 Multi-core speedup 

Here we solve 9 random Li constrained Li variance SPCA instances of sizesp = 100x2*, i = 1, . . . , 9, 
n = p/10, with 100 SPs each, on a machine using 1, 2,4 and 8 cores; see Figure El 




Figure 5: Multi-core speedup is proportional to the number of cores. 

The plot on the LEFT shows the total computational time; the plot on the RIGHT shows the 
speedup of multi-core codes compared to the single-core code. Note that the speedup is consistently 
close to the number of cores for the 2 and 4-core setups across all problem sizes, and is growing 
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with p from 5x to about 7.5 x in the 8-core setup. 
5.2 GPU speedup 

Here we solve 8 random Li penalized Li variance SPCA instances with p varying roughly between 
10^ and 10^ and n = p/200. We solved all formulations with {1, 16, 256} SPs on a single-core CPU 
and a GPU; the results are shown in Figure El 




Figure 6: GPU code can achieve 125 x speedup compared to single-core when 256 starting points 
are used. 

The plot on the LEFT shows the total computational time. The red lines with triangle markers 
correspond to the single-core setup, the "higher" the line, the more starting points were used. The 
blue lines with square markers correspond to our GPU codes. While the runtime increases linearly 
with problem size for the single-core codes, it grows slowly for the GPU codes. Note that the GPU 
code may actually be slower for small problem sizes. Looking at the RIGHT plot, we see that the 
GPU code is capable of a 100-125 x speedup; this happens for large problem sizes and 256 SPs. 
The speedup can reach 100 x for 16 SPs as well. 

5.3 Cluster code 

In this experiment we solved several Li penalized L2 variance SPCA problems with a fully dense 
matrix A G R"^^'; the results are in Table [H We focus our discussion on the largest of the problems 
only (last three lines of the table), one with n = 6 x 10^ and p = 8 X 10^. We used a cluster of 800 
CPUs; storage of the data matrix required 357.6 GB of memory. The matrix was first loaded from 
files to memory; this process took ti = 92 seconds. Subsequently, the loaded data was distributed 
to CPUs where needed, which took additional ^2 = 713 seconds. Finally we run the AM method 
with 1, 32 and 64 starting points and measured the average time of a single iteration; the results 
are = 4.1, = 51.1 and tg = 134.9 seconds, respectively. When using a single starting point, the 
method would converge in about a minute. The column of Table |4] depicts the time it takes for 
the solver to perform k iterations. We treated the problem directly, without using any safe feature 
elimination techniques [H]. Such preprocessing could, however, be able to expand the reach of our 
cluster code to even larger problem sizes. 
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Table 4: Result from Cluster. For first 1 experiment the dimensions of virtual grid was the same 
as loaded data and hence t2 is small. For the rest we used 64x64 pieces. 

5.4 Large text corpora 

In the first experiment we tested the AM method with Lq constrained L2 variance formulation 
(with s = 5) on two medium-size data sets from the Machine Learning Repository [18]: news 
articles appeared in New York Times and abstracts of articles published in PubMed. Each data 
set is formatted as a matrix A G R"'^^', where the rows of A correspond to news articles in the 
NYTimes data set and to abstracts in PubMed, and the columns correspond to words. The number 
of appearances of word j in article or abstract i is the (i,j)-th entry of A; the matrices are hence 
clearly sparse. The NYTimes data set has 102,660 articles, 300,000 words, and approximately 
70 million nonzero entries. The PubMed data set contains 141,043 articles, 8.2 million words, and 
approximately 484 million nonzeroes. The matrices can be stored in 0.778 GB and 5.42 GB memory 
space, respectively. We have customized the AM method to exploit sparsity as much as possible. 
In Table [5] we present the first 5 sparse principal components (5 words each). Clearly, the first PC 
for NYT is about sports, the second about business, the third about elections, the fourth about 
education and the fifth about United States. Similar interpretations can be given to the PubMed 
PCs. 

5.5 Implementation details 

For single and multi-core architectures we developed our codes using the CBLAS interface. In 
particular, we use both the GSL BLAS and the Intel MKL [15] implementations (single-core) and 
the GotoBLAS2 |16j and Intel MKL implementations (multi-core). Parallelization in the multi-core 
case is performed by the OpenMP interface. When comparing the performance of single-core and 
multi-core architectures, we use Intel MKL library for both serial and parallel versions of the same 
algorithm for consistency. Nevertheless, in our experience, GotoBLAS2 implementation of these 
algorithms are faster than the Intel MKL implementation. We use CuBLAS version 4.0 [T7] on GPU 
(and make use of Thrust whenever possible for operations such as sorting, memory arrangements 
and data allocation on GPU). For comparisons between single-core and GPU architectures, we use 
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Table 5: First 5 sparse PCs for NYTimes and PubMed data sets. 

the GSL BLAS implementation on the single-core. On a cluster, linear algebra is done with Intel 
MKL's PBLAS, while communication between nodes is via MPI. 

6 Conclusion 

We propose a unifying framework for solving 8 SPCA formulations in which all have the same form 
and are solved by the same algorithm: the alternating maximization (AM) method. We observed 
that AM is in all cases equivalent to the GPower method applied to a suitable convex function. 
Five of these formulations were previously studied in the literature and three were not; notably 
the Li constrained Li (robust) variance seems to be new. For each of these formulations we have 
written 4 efficient codes — one serial and three parallel — aimed at single-core, multi-core and GPU 
workstations and a cluster. All these codes are enabled with efficient parallel implementations of 
a multiple-starting-point globalization strategy which aims to find PCs explaining more variance; 
with speedup per starting point achieving up to two orders of magnitude. The most efficient of 
these implementations is "on-the-fly" . We demonstrated that our cluster code is able to solve a 
very large problem with a 357 GB fully dense data matrix. 

Acknowledgements 

The work of P. Richtarik and M. Takac was supported by the EPSRC grant EP/I017127/1 (Mathe- 
matics for Vast Digital Resources) and the Centre for Numerical Algorithms and Intelligent Software 
(funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council). P. Richtarik was also 
partially supported by the EPSRC grant EP/J020567/1 (Algorithms for Data Simplicity). S. D. 
Ahipa§aoglu was partially supported by the SUTD Start-Up Research Grant, Project Number: 
SRG ESD 2012 033. 

References 

[1] Jollife, I. (1986) Principal component analysis. Springer Verlag, NY. 



19 



[2] Mackey, L. (2008) Deflation Methods for Sparse PCA. Advances in Neural Information Pro- 
cessing Systems. 21:1017-1024. 

[3] Kwak, N. (2008) Principal component analysis based on Li norm maximization. IEEE Trans- 
actions on Pattern Analysis and Machine Intelligence 30:1672-1680. 

[4] Zou, H., Hastie, T. and Tibshirani, R. (2004) Sparse principal component analysis. Technical 
report, Stanford University. 

[5] Moghaddam, B., Weiss, Y. and Avidan, S., (2006) Spectral bounds for sparse PCA: exact and 
greedy algorithms. In: Advances in Neural Information Processing Systems 18:915-922. 

[6] d'Aspremont, A., El Ghaoui, L., Jordan, M.I. and Lanckriet, G.R.G. (2007) A direct formula- 
tion for sparse PCA using semidefinite programming. SIAM Review 48(3):434-448. 

[7] d'Aspremont, A., Bach, F. and El Ghaoui, L. (2008) Optimal solutions for sparse principal 
component analysis. Journal of Machine Learning Research 9:1269-1294. 

[8] Lu, Z.S. and Zhang, Y. (2009) An augmented lagrangian approach for sparse principal com- 
ponent analysis. Mathematical Programming, Series A. DOI: 10.1007/sl0107-011-0452-4. 

[9] Journee, M., Nesterov, Y., Richtarik, P. and Sepulchre, R. (2010) Generalized power method 
for sparse principal component analysis. Journal of Machine Learning Research 11:517-553. 

[10] Luss, R. and Teboulle, M. (2011) Conditional gradient algorithms for rank-one matrix approx- 
imations with a sparsity constraint. Technical report. 

[11] Meng, D., Zhao, Q. and Xu, Z. (2012) Improve robustness of sparse PCA by Li-norm maxi- 
mization. Pattern Recognition 45:487-497. 

[12] Richtarik, P. (2011) Finding sparse approximations to extreme eigenvectors: generalized power 
method for sparse PCA and extensions. In: Proceedings of Signal Processing with Adaptive 
Sparse Structured Representations. 

[13] Bah, B., Tanner, J. (2010) Improved bounds on restricted isometry constants for Gaussian 
matrices. SIAM Journal on Matrix Analysis and Applications 31:2882-2898. 

[14] Zhang, Y., El Ghaoui, L. (2011) Large-scale sparse principal component analysis with appli- 
cation to text data. In: Advances in Neural Information Processing Systems 24:532-539. 

[15] Intel Math Kernel Library (Intel MKL), 

http:/ / software.intel.com / en-us / articles / Intel- mkl / 

[16] GotoBLAS2, Texas Advanced Computing Center, 
http://www.tacc.utexas.edu/tacc-projects/gotoblas2 

[17] NVIDIA CUDA Basic Linear Algeb ra Subroutines (CuBLAS), 
http:/ / developer.nvidia.com/cublas 

[18] David Newman, Bag of Words Data Set, University of California, Irvine, 
http : // archive . ics . uci.edu /ml/ dat asets/B ag-|-of-|- Words 



20 



